---
title: |
  SciData Code example
author: 
    - name: John Doe
      affiliation: 
        name: "1,2,* "
authornotes:
    - note: "1"
      text: Department of Statistics
    - note: "2"
      text: Department of Government
    - note: "*"
      text: "`correspondingauthor@email.com`"
format:
  scientific-data-pdf: 
    keep-tex: true
date: last-modified
abstract: "abstract"
bibliography: bibliography.bib
editor: 
  markdown: 
    wrap: sentence
---


```{r}
#| echo: false
suppressPackageStartupMessages(
  {
    library(tidyverse)
    library(arrow)
    library(here)
  }
)
```

```{r}
#| echo: false
ds <- open_dataset(here("release"))
```

\paragraph{Reading in the Data}

We store our dataset in a parquet format. 
Parquet is a modern file storage format optimized for querying large datasets.
It is partitioned by grouping variables, and it is columnar (so that users do not need to read in an entire row to extract a value from one column).
Our dataset is prohibitively large to read and write in a plain-text format (20 Gb), but is compact  and easy to read from in parquet (700 Mb).
In R, we use the \texttt{arrow} package to query parquet files. For more information on how to read and write parquet files in R, see \url{https://r4ds.hadley.nz/arrow}. Parquet is also designed for usage in Python (\url{https://arrow.apache.org/docs/python/parquet.html}) and several other programming languages.


The following command opens the dataset.

```{r, eval=FALSE}
library(tidyverse)
library(arrow)

ds <- open_dataset("cvrs")
```

\noindent Here, \texttt{"cvrs"} indicates the path to the top-level folder containing the parquet files downloaded from Dataverse. 
Our data is organized by county, nested within states. 
After unzipping the zip file or by previewing on Dataverse, we see that \texttt{cvrs} has the following structure:

```{text}
├── state=ARIZONA
│   ├── county_name=MARICOPA
│   │   └── part-0.parquet
│   ├── county_name=PIMA
│   │   └── part-0.parquet
│   ├── county_name=SANTA%20CRUZ
│   │   └── part-0.parquet
│   └── county_name=YUMA
│       └── part-0.parquet
...
── state=UTAH
│   └── county_name=SAN%20JUAN
│       └── part-0.parquet
└── state=WISCONSIN
    ├── county_name=BROWN
    │   └── part-0.parquet
    ├── county_name=KENOSHA
    │   └── part-0.parquet
    ├── county_name=PIERCE
    │   └── part-0.parquet
    └── county_name=WAUKESHA
        └── part-0.parquet
```

Because parquet is columnar, users will find it much faster to produce summary statistics of the data. Even though the code below counts some `r round(pull(collect(count(ds)))/1e6)` million rows, it performs the count in one second on a personal laptop. 

```{r}
ds |> count(office) |> collect()
```

To perform this count, we used `count()` from `dplyr`, which totals the number of occurrences of each unique value in our `office` variable. 
We make use of R's pipe operator, `|>`, to pass our data objects forward onto subsequent operations we want to perform.  

Finally, we must use the `collect()` command from `arrow` to extract the summary. 
All transformations before `count()` are _lazily-loaded_, meaning that they are not executed until needed.
The arrow program combines the transformations internally in a way that avoids duplicative operations.

\paragraph{Extracting Summaries}

Users should use the combination of `state`, `office`, and `party` variable to identify candidates.
The code below first limits to vote choices for President in Wisconsin ballots using the `filter()` command, and counts the number of records for each candidate-party collection, sorted from most frequent to least.

```{r}
ds |> 
  filter(state == "WISCONSIN", office == "US PRESIDENT") |> 
  count(candidate, party, sort = TRUE) |> 
  collect()
```

For individual voters, use the `cvr_id` variable within a state and county.
This ID is a numeric variable that is defined within counties.
These numbers do not in any way indicate the time in which the ballot was cast, or the personal identity of the voter.
The following code extracts the vote from the voter marked with the `cvr_id` of 1. 

```{r}
ds |> 
  filter(state == "ARIZONA", county_name == "MARICOPA") |> 
  filter(cvr_id == 1) |> 
  select(county_name, cvr_id, office, district, candidate, party) |> 
  collect()
```

This example shows that this voter split their ticket, voting for Democrats in the Presidential and Congressional race, while voting for one Republican candidate in State Senate.
However, further investigation into this voter's State Senate district shows that it was uncontested. That is, with the following query,

```{r}
ds |> 
  filter(state == "ARIZONA", office == "STATE SENATE", district == "013") |> 
  count(candidate, party_detailed) |> 
  collect()
```

\noindent we see that none of the ballots in State Senate district 13 were for a Democrat candidate, indicating that no Democrat ran in this district.

\paragraph{Application: Biden and Trump's Party Loyalty}

As our main exercise, we ask whether partisans --- defined by their votes for Congress and state legislature --- vote for their party's presidential candidate.
Donald Trump was a polarizing candidate.
Election observers have wondered if Trump drew less support from Republican-leaning voters compared to the support his opponent, Joe Biden, drew from Democratic-leaning voters. 
Some referred to these group of voters as ``Never Trump Republicans.''

For this analysis, we study the counties in five battleground states which together decided the election: Wisconsin, Michigan, Georgia, Arizona, and Nevada. 
Biden won Georgia, Arizona, and Wisconsin by less than a percentage point, and won Nevada and Michigan by less than 3 percentage points.

```{r}
ds_states <- ds |> 
  filter(state %in% c("WISCONSIN", "MICHIGAN", "GEORGIA", "ARIZONA", "NEVADA"))
```

Recall that aggregate election results report how many votes Biden and Trump received, but unlike cast vote records, they do not reveal which of those votes came from Republicans and Democratic voters.
Only cast vote records can classify voters into partisan types based on how they voted in all offices except President.

We first need to narrow down our data so that we only use voter-contest pairs in contests contested by a Democrat and a Republican.
In other words, the voter needed to have a choice to vote for a Republican or Democrat.

```{r}
ds_contested <- ds_states |> 
  collect() |> 
  # Contested contests
  filter(any(party == "REP") & any(party == "DEM"), 
         .by = c(state, office, district)) |> 
  # Ballots with Presidential vote
  filter(any(office == "US PRESIDENT"), 
         .by = c(state, county_name, cvr_id))
```

The first `filter()` command in this code limits to vote choices for contested offices.
For each state-office-district combination, we examine if there are any Republican candidates *and* any Democrats.
Contests that do not meet this criteria are dropped.
The second `filter()` command limits to ballots with a Presidential choice.
This excludes fragmented ballots where the President and the rest of the ballot is separated.
Both commands are done after `collect()` because the `arrow` package does not support group-specific filter commands as of version 16.1.0.

```{r}
#| echo: false
#| eval: false
## This is 3x faster because we can use arrow verbs
tictoc::tic()
rows_contes <- ds_states |> 
  summarize(
    contested = any(party == "REP") & 
      any(party == "DEM"), 
    .by = c(state, office, district)
  ) |> 
  filter(contested)

rows_has_pres <- ds_states |> 
  summarize(
    has_pres = any(office == "US PRESIDENT"), 
     .by = c(state, county_name, cvr_id)
  ) |> 
  filter(has_pres)

ds_contested <- ds_states |> 
  semi_join(rows_contes) |> 
  semi_join(rows_has_pres) |> 
  collect()

tictoc::toc()
```


We now construct a dataset where each row is a single voter.
We first create a dataset of Presidential votes:

```{r}
## Voters based on President
ds_pres <- ds_contested |> 
  filter(office == "US PRESIDENT") |> 
  select(
    state, county_name, 
    cvr_id, candidate,
    pres_party = party) |> 
  mutate(pres = case_when(
    pres_party == "REP" ~ "Trump", 
    pres_party == "DEM" ~ "Biden", 
    pres_party == "LBT" ~ "Libertarian", 
    candidate == "UNDERVOTE" ~ "Undervote",
    .default = "Other"))
```

Separately, we construct a dataset that classifies the same voters based on their non-Presidential vote choice.
The variable `nonpres_party` is `Down-ballot Democrat` if the voter only votes for Democrats down-ballot (using the `all()` command) and it is `Down-ballot Republican` if the voter only votes for Republicans down-ballot.

```{r}
## subset to all-Dem voters based on everything except President
ds_D <- ds_contested |> 
  filter(office != "US PRESIDENT") |> 
  filter(all(party == "DEM"), .by = c(state, county_name, cvr_id)) |> 
  distinct(state, county_name, cvr_id) |> 
  mutate(nonpres_party = "Down-ballot Democrat")

## same subset, but for all-Rep voters
ds_R <- ds_contested |> 
  filter(office != "US PRESIDENT") |> 
  filter(all(party == "REP"), .by = c(state, county_name, cvr_id)) |> 
  distinct(state, county_name, cvr_id) |> 
  mutate(nonpres_party = "Down-ballot Republican")

```

Now we join voter's choices for President with their down-ballot choices.
Because each row is now a single voter, we join one-to-one using `state`, `county_name`, and `cvr_id`.
Voters who were not classified into Democrats or Republican, are, by construction, those who voted for some Democratic down-ballot candidates and Republican down-ballot candidates, or undervoted in some of these offices. 
We label them `Mixed`.

```{r}
ds_analysis <- ds_pres |> 
  left_join(
    bind_rows(ds_D, ds_R), 
    by = c("state", "county_name", "cvr_id"), relationship = "one-to-one") |> 
  mutate(nonpres_party = replace_na(nonpres_party, "Mixed"))
```

\noindent Finally, we construct a cross-tabulation of this dataset using the base-R `xtabs()` function.

```{r}
xtabs(~ nonpres_party + pres, ds_analysis) |> 
  addmargins()
```

This table shows for example that among `r format(nrow(filter(ds_analysis, nonpres_party == "Down-ballot Democrat")), big.mark = ",")` solidly Democratic voters, `r format(nrow(filter(ds_analysis, nonpres_party == "Down-ballot Democrat", pres == "Biden")), big.mark = ",")` voted for Joe Biden.
We can show cell counts in terms of proportions of the entire row, with the following base-R operation:
```{r}
xtprop <- xtabs(~ nonpres_party + pres, ds_analysis) |> 
  prop.table(margin = 1) |> 
  round(3) 

## add margins
row_Ns <- xtabs(~ nonpres_party, ds_analysis) |> 
  format(big.mark = ",")

## reorder columns and append totals
xtprop[, c("Biden", "Trump", "Libertarian", "Undervote")] |> 
  cbind(row_Ns) |> 
  kableExtra::kbl(format = "latex", booktabs = TRUE)

```

\bigskip

This formatted table shows more clearly that the ticket splitting rate among solid partisans was on the order of 1 percent in this sample. 
Such small subgroups are almost impossible to detect in a survey. 
In contrast, 97 percent of solid Republicans stuck with their party's nominee, Trump, and 98 percent of solid Democrats stuck with Biden.
Trump's party loyalty was a percentage point smaller than Biden's.

A starker difference arises in the mixed group (those who vote for some Republicans and some Democrats, or undervoted, down-ballot).
Biden won this group of weak partisans by more than 10 points. 
Close to 5 percent of this group voted for the third-party Libertarian candidate for President, instead of picking either Biden or Trump. 
Undervoting for President was low, less than 1 percent, in these group of voters. 

More can be done to examine if these results vary by state, county, or precinct.
Future versions of this dataset can also include ballot measures and local candidates that give more context of these patterns.
